Hi, in this notebook we'll be using Hotel Booking Demand data from Kaggle. This data set contains booking information for a city hotel and a resort hotel, and includes information such as when the booking was made, length of stay, the number of adults, children, and/or babies, and the number of available parking spaces, among other things. We are going to visualize the data by using plotly and also predict the possibility of hotel bookings using random forest model using sklearn.
As usual, before we begin any analysis and modeling, let's import several necessary libraries to work with the data. There are two additional packages used in this notebook:
pycountry: contains mapping of ISO country, subdivision, language, currency and script definitions and their translationsppscore: implementation of the Predictive Power Score (PPS)# Data Analysis
import pandas as pd
import numpy as np
# Visualization
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
# Interactive Plotting
import plotly.express as px
import plotly.io as pio
import plotly.offline as py
pio.templates.default = "plotly_white"
# Modeling
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, cross_val_score, RandomizedSearchCV
from sklearn.metrics import classification_report, roc_curve, roc_auc_score
from sklearn.utils import resample
import multiprocessing
import pickle
# Additional Packages
import pycountry
import pycountry_convert as pc
import ppscore as pps
from tqdm.notebook import tqdm_notebook
import warnings
warnings.filterwarnings('ignore')
Before we jump into any visualization or modeling step, we have to make sure our data is ready.
Load the downloaded csv and inspect the structure of the data.
hotel = pd.read_csv("dataset/hotel_bookings.csv")
hotel.head()
hotel.info()
Our dataframe hotel contains 119390 rows of bookings and 32 columns with data description as follows:
hotel: Hotel (H1 = Resort Hotel or H2 = City Hotel)is_canceled: Value indicating if the booking was canceled (1) or not (0)lead_time: Number of days that elapsed between the entering date of the booking into the PMS and the arrival datearrival_date_year: Year of arrival datearrival_date_month: Month of arrival datearrival_date_week_number: Week number of year for arrival datearrival_date_day_of_month: Day of arrival datestays_in_weekend_nights: Number of weekend nights (Saturday or Sunday) the guest stayed or booked to stay at the hotelstays_in_week_nights: Number of week nights (Monday to Friday) the guest stayed or booked to stay at the hoteladults: Number of adultschildren: Number of childrenbabies: Number of babiesmeal: Type of meal booked. Categories are presented in standard hospitality meal packages: country: Country of origin. Categories are represented in the ISO 3155–3:2013 formatmarket_segment: Market segment designation. In categories, the term “TA” means “Travel Agents” and “TO” means “Tour Operators”distribution_channel: Booking distribution channel. The term “TA” means “Travel Agents” and “TO” means “Tour Operators”is_repeated_guest: Value indicating if the booking name was from a repeated guest (1) or not (0)previous_cancellations: Number of previous bookings that were cancelled by the customer prior to the current bookingprevious_bookings_not_canceled: Number of previous bookings not cancelled by the customer prior to the current bookingreserved_room_type: Code of room type reserved. Code is presented instead of designation for anonymity reasons.assigned_room_type: Code for the type of room assigned to the booking. Sometimes the assigned room type differs from the reserved room type due to hotel operation reasons (e.g. overbooking) or by customer request. Code is presented instead of designation for anonymity reasons.booking_changes: Number of changes/amendments made to the booking from the moment the booking was entered on the PMS until the moment of check-in or cancellationdeposit_type: Indication on if the customer made a deposit to guarantee the booking. This variable can assume three categories: agent: ID of the travel agency that made the bookingcompany: ID of the company/entity that made the booking or responsible for paying the booking. ID is presented instead of designation for anonymity reasonsdays_in_waiting_list: Number of days the booking was in the waiting list before it was confirmed to the customercustomer_type: Type of booking, assuming one of four categories:adr: Average Daily Rate as defined by dividing the sum of all lodging transactions by the total number of staying nightsrequired_car_parking_spaces: Number of car parking spaces required by the customertotal_of_special_requests: Number of special requests made by the customer (e.g. twin bed or high floor)reservation_status: Reservation last status, assuming one of three categories:reservation_status_date: Date at which the last status was set. This variable can be used in conjunction with the ReservationStatus to understand when was the booking canceled or when did the customer checked-out of the hotel.Check if there are any missing values in hotel
hotel.isna().sum().sort_values(ascending = False)
There are four columns with missing values, here's how we handle them:
agent and company, since the missing values are too many and we won't use them for predictioncountrychildren with the value 0hotel.drop(columns = ['agent', 'company'], inplace = True)
hotel['country'].fillna("UNKNOWN", inplace = True)
hotel['children'].fillna(0, inplace = True)
Check whether there is another missing values.
hotel.isna().values.any()
We convert object to category data types to save memory. Also map the boolean columns is_canceled and is_repeated_guest into category for readability.
category_cols = ['hotel', 'meal', 'country', 'market_segment', 'distribution_channel', 'reserved_room_type', 'assigned_room_type', 'deposit_type', 'customer_type', 'reservation_status']
boolean_cols = ['is_canceled', 'is_repeated_guest']
boolean_map = {0: 'No', 1: 'Yes'}
hotel['is_canceled'] = hotel['is_canceled'].map(boolean_map)
hotel['is_repeated_guest'] = hotel['is_repeated_guest'].map(boolean_map)
hotel[category_cols + boolean_cols] = hotel[category_cols + boolean_cols].astype('category')
hotel['is_canceled'].cat.reorder_categories(list(boolean_map.values()), inplace = True)
hotel['is_repeated_guest'].cat.reorder_categories(list(boolean_map.values()), inplace = True)
Convert children from float to integer.
hotel['children'].apply(float.is_integer).all()
hotel['children'] = hotel['children'].astype('int')
Convert reservation_status_date as datetime.
hotel['reservation_status_date'] = hotel['reservation_status_date'].astype('datetime64')
Here's the final data types of hotel:
hotel.dtypes
Feature engineering is the process of using domain knowledge to extract features from provided raw data. These features can be used to improve the performance of machine learning models.
Instead of considering each assigned and reserved room type, we create a new column is_assigned_as_reserved to make a flag whether the customer get their expected room type or not.
hotel['reserved_room_type'].cat.set_categories(hotel['assigned_room_type'].cat.categories, inplace = True)
hotel['is_assigned_as_reserved'] = (hotel['assigned_room_type'] == hotel['reserved_room_type']).astype('category')
hotel['is_assigned_as_reserved']
Combine arrival_date_year, arrival_date_month, arrival_date_day_of_month into one column arrival_date so that we can extract more information from the date.
arrival_date_cols = ['arrival_date_year', 'arrival_date_month', 'arrival_date_day_of_month']
hotel[arrival_date_cols] = hotel[arrival_date_cols].astype(str)
hotel['arrival_date'] = pd.to_datetime(hotel[arrival_date_cols].apply('-'.join, axis = 1), format = "%Y-%B-%d")
hotel.drop(columns = arrival_date_cols + ['arrival_date_week_number'], inplace = True)
Create booking_date by subtracting lead_time days from arrival_date.
hotel['booking_date'] = hotel['arrival_date'] - pd.to_timedelta(hotel['lead_time'], unit = 'days')
hotel[['booking_date', 'arrival_date', 'lead_time']].head()
The column country represents code of a country in the ISO 3155–3:2013 format. By utilizing the code-to-name mapping provided in pycountry package, we can extract it into country_name and continent_name.
additional_code2name = {'TMP': 'East Timor'}
def convertCountryCode2Name(code):
country_name = None
try:
if len(code) == 2:
country_name = pycountry.countries.get(alpha_2 = code).name
elif len(code) == 3:
country_name = pycountry.countries.get(alpha_3 = code).name
except:
if code in additional_code2name.keys():
country_name = additional_code2name[code]
return country_name if country_name is not None else code
hotel['country_name'] = hotel['country'].apply(convertCountryCode2Name).astype('category')
hotel['country_name'].head()
additional_name2continent = {'East Timor': 'Asia', 'United States Minor Outlying Islands': 'North America', 'French Southern Territories': 'Antarctica', 'Antarctica': 'Antarctica'}
def convertCountryName2Continent(country_name):
continent_name = None
try:
alpha2 = pc.country_name_to_country_alpha2(country_name)
continent_code = pc.country_alpha2_to_continent_code(alpha2)
continent_name = pc.convert_continent_code_to_continent_name(continent_code)
except:
if country_name in additional_name2continent.keys():
continent_name = additional_name2continent[country_name]
else:
continent_name = "UNKNOWN"
return continent_name if continent_name is not None else country_name
hotel['continent_name'] = hotel['country_name'].apply(convertCountryName2Continent).astype('category')
hotel['continent_name'].head()
There are some hidden anomalies present in the hotel bookings. Let's create new variables and plot their frequencies:
total_guest: Total number of adults, children, and babiestotal_nights: Number of nights the guest stayed at the hotel, sum of stays_in_weekend_nights and stays_in_week_nightshotel['total_guest'] = hotel[['adults', 'children', 'babies']].sum(axis = 1)
hotel['total_nights'] = hotel[['stays_in_weekend_nights', 'stays_in_week_nights']].sum(axis = 1)
data2plot = [hotel['total_guest'].value_counts().sort_index(ascending = False),
hotel['total_nights'].value_counts().sort_index(ascending = False)[-21:]]
ylabs = ["Total Guest per Booking (Person)", "Total Nights per Booking"]
titles = ["FREQUENCY OF TOTAL GUEST PER BOOKING\n", "FREQUENCY OF TOTAL NIGHTS PER BOOKING\n(UP TO 20 NIGHTS ONLY)"]
fig, axes = plt.subplots(1, 2, figsize = (15, 5))
for ax, data, ylab, title in zip(axes, data2plot, ylabs, titles):
bp = data.plot(kind = 'barh', rot = 0, ax = ax)
for rect in bp.patches:
height = rect.get_height()
width = rect.get_width()
bp.text(rect.get_x() + width,
rect.get_y() + height/2,
int(width),
ha = 'left',
va = 'center',
fontsize = 8)
bp.set_xlabel("Frequency")
bp.set_ylabel(ylab)
ax.set_title(title, fontweight = "bold")
There are 180 bookings without guest (total_guest = 0) and 715 bookings with zero nights of staying at the hotel (total_nights = 0). Ideally, such cases should not occur on our bookings data. Therefore, from this point onwards we will ignore the observations with either cases since it can affect our modeling outcome.
hotel = hotel[(hotel['total_guest'] != 0) & (hotel['total_nights'] != 0)]
hotel.shape
We end up with 118565 rows of bookings, originally it was 119390 rows.
Before we jump into the modeling step, it is recommended to visualize the data to better understand our data.
df_cancel_status = pd.crosstab(index = hotel.is_canceled,
columns = hotel.reservation_status,
margins = True)
ax = df_cancel_status.iloc[:-1,:-1].plot(kind = 'bar', stacked = True, rot = 0)
for rect in ax.patches:
height = rect.get_height()
width = rect.get_width()
if height != 0:
ax.text(rect.get_x() + width,
rect.get_y() + height/2,
int(height),
ha = 'left',
va = 'center',
color = "black",
fontsize = 10)
handles, labels = ax.get_legend_handles_labels()
ax.legend(handles = handles, labels = labels)
percent_no = (100*df_cancel_status/df_cancel_status.iloc[-1,-1]).loc["No", "All"]
ax.set_xticklabels(["No\n({:.2f} %)".format(percent_no), "Yes\n({:.2f} %)".format(100-percent_no)])
ax.set_xlabel("Canceled?")
ax.set_ylabel("Number of Bookings")
plt.title("BOOKING CANCELLATION PROPORTION", fontweight = "bold")
plt.show()
The proportion of the target variable is_canceled is somewhat balanced. There is 37.26% of the bookings which are canceled, divided into two cases:
df_choropleth = hotel.copy()
df_choropleth['booking_date_year'] = df_choropleth['booking_date'].dt.year
df_country_year_count = df_choropleth.groupby(['country', 'booking_date_year']).count()['hotel'].fillna(0).reset_index() \
.rename(columns={'country': 'country_code', 'booking_date_year': 'year', 'hotel':'count'})
df_country_year_count['country_name'] = df_country_year_count['country_code'].apply(convertCountryCode2Name)
df_country_year_count['count'] = df_country_year_count['count'].astype('int')
fig = px.choropleth(df_country_year_count[df_country_year_count["year"] != 2013],
locations = "country_code", color = "count", animation_frame = "year",
hover_name = "country_name",
range_color = (0, 5000),
color_continuous_scale = px.colors.sequential.Reds,
projection = "natural earth")
fig.update_layout(title = 'ANNUAL HOTEL BOOKING COUNTS',
template = "seaborn")
fig.show(renderer="notebook")
From the choropleth map we can see throughout the year, Europe is the continent with the most hotel booking counts. The specific country with the most bookings is Portugal (PRT).
From the previous section, we know Europe is the continent with most bookings, but how about the cancellation rate?
ax = pd.crosstab(index = hotel['continent_name'],
columns = hotel['is_canceled'],
margins = True).sort_values('All').iloc[:-1,:-1].plot(kind = 'barh')
ax.legend(bbox_to_anchor = (1, 1), title = "Canceled?")
ax.set_xlabel("Number of Bookings")
ax.set_ylabel("Continent Name")
ax.set_title("BOOKINGS BY EACH CONTINENT", fontweight = "bold")
plt.show()
ax = (pd.crosstab(index = hotel['continent_name'],
columns = hotel['is_canceled'],
normalize = 'index').sort_values('Yes') * 100).plot(kind = 'barh', stacked = True)
ax.legend(bbox_to_anchor = (1, 1), title = "Canceled?")
ax.set_xlabel("Percentage of Bookings")
ax.set_ylabel("Continent Name")
ax.set_title("PERCENTAGE OF BOOKINGS CANCELLATION BY EACH CONTINENT", fontweight = "bold")
plt.show()
It turns out that Africa has the greatest cancellation rate amongst the other continent.
df_cancellation = hotel.copy()
df_cancellation['date_period'] = df_cancellation['reservation_status_date'].dt.to_period('M')
df_cancellation_percent = df_cancellation.groupby(['date_period', 'is_canceled', 'hotel'])['hotel'].count() \
.groupby(['date_period', 'hotel']).apply(lambda x: 100*x/x.sum()) \
.unstack(level = 'is_canceled') \
.rename(columns = str).reset_index().rename_axis(None, axis = 1).rename(columns = {'hotel': 'Hotel Type'})
df_cancellation_percent['date_period'] = df_cancellation_percent['date_period'].values.astype('datetime64[M]')
fig = px.line(df_cancellation_percent, x = 'date_period', y = 'Yes', color = 'Hotel Type')
fig.update_traces(mode = "markers+lines",
hovertemplate = "Rate: %{y:.2f}%")
fig.update_layout(title = 'CANCELLATION RATE OVER TIME BY HOTEL TYPE',
xaxis_title = 'Cancellation Period',
yaxis_title = 'Cancellation Rate (%)',
hovermode = 'x',
template = "seaborn",
xaxis = dict(tickformat="%b %Y"))
fig.show(renderer="notebook")
Most of the time, City hotel has greater cancellation rate than Resort hotel. The good news is that: both rate are decreasing towards zero during mid 2017, meaning none of the booking was cancelled. But we have to anticipate during the beginning of year 2018, since for the past three years there are peaks on the cancellation rate during January. Therefore in the next section, we can use a Machine Learning model to predict whether certain bookings will be canceled by the customer or not from predictor variables present in our data.
According to Florian Wetschoreck, PPS is an asymmetric and data-type-agnostic score that can detect linear or non-linear relationships between two columns. One column acts as an univariate predictor, whereas the other acts as the target variable. The score ranges from 0 (no predictive power) to 1 (perfect predictive power). It can be used as an alternative to the correlation matrix.
The PPS is calculated as follows:
$PPS = \dfrac{F1_{model} - F1_{naive}}{1 - F1_{naive}}$
where:
sklearn.DecisionTreeClassifier.Detailed explanation is available here.
Before we investigate the PPS, we do the following:
booking_date, reservation_status_date, and arrival_date.assigned_room_type and reserved_room_type because the levels are quite many, instead is_assigned_as_reserved will be considered.country and country_name because the levels are too many, instead continent_name will be considered.datetime_cols = ['booking_date', 'reservation_status_date', 'arrival_date']
for col in datetime_cols:
hotel[f"{col}_dayofyear"] = hotel[col].dt.dayofyear
ignore_cols = ['assigned_room_type', 'reserved_room_type', 'country', 'country_name']
hotel_pps_data = hotel.drop(datetime_cols + ignore_cols, axis = 1)
hotel_pps_dummy = pd.get_dummies(hotel_pps_data)
hotel_pps_dummy.head()
We treat each columns of hotel_pps_dummy as a univariate predictors of is_canceled, then calculate the PPS and present the result in the form of DataFrame.
pps_score = []
target = 'is_canceled_Yes'
for col in hotel_pps_dummy.columns:
if col == target:
continue
d = {}
d['feature'] = col
d['dtypes'] = 'categorical' if hotel_pps_dummy[col].dtypes == 'uint8' else 'numerical'
d['pps'] = pps.score(hotel_pps_dummy, x = col, y = target, task = 'classification')['ppscore']
pps_score.append(d)
hotel_pps = pd.DataFrame(pps_score).set_index('feature')
hotel_pps.head()
ax = hotel_pps[hotel_pps['dtypes'] == 'categorical'].sort_values('pps')[:-1]\
.plot(kind = 'barh', legend = False, figsize = (5, 12))
for rect in ax.patches:
height = rect.get_height()
width = rect.get_width()
ax.text(rect.get_x() + width,
rect.get_y() + height,
round(width, 5),
ha = 'left',
va = 'top',
fontsize = 8)
ax.set_xlabel("PPS")
ax.set_ylabel("Predictor Variable")
plt.title("CATEGORICAL PREDICTORS PREDICTIVE POWER SCORE\n TARGET: is_canceled", fontweight = "bold")
plt.show()
From PPS of the categorical variables, we have to ignore reservation_status for the modeling. The score is high because from the business perspective, this value is actually is_canceled but breakdown into three specific categories. So we cannot use this as a predictor.
ax = hotel_pps[hotel_pps['dtypes'] == 'numerical'].sort_values('pps')\
.plot(kind = 'barh', legend = False, figsize = (5, 5))
for rect in ax.patches:
height = rect.get_height()
width = rect.get_width()
ax.text(rect.get_x() + width,
rect.get_y() + height/2,
round(width, 5),
ha = 'left',
va = 'center',
fontsize = 8)
ax.set_xlabel("PPS")
ax.set_ylabel("Predictor Variable")
plt.title("NUMERICAL PREDICTORS PREDICTIVE POWER SCORE\n TARGET: is_canceled", fontweight = "bold")
plt.show()
From PPS of the numerical variables, we have ignore reservation_status_date_dayofyear the reason is the same as before when we ignore reservation_status. We can ignore stay_in_week_nights and stay_in_weekend_nights because already explained by total_nights. Also, adults, children, and babies are explained by total_guest.
Now, let's proceed to the modeling step using Random Forest Classifier by sklearn.
PPS score from the previous section help us to identify which column to be ignored. We are going to drop four columns before we fit the data into Random Forest model.
ignore_cols_for_model = ['reservation_status', 'reservation_status_date_dayofyear',
'total_nights', 'total_guest']
hotel_model = hotel_pps_data.drop(columns = ignore_cols_for_model)
We split our dataset into 80% train and 20% test dataset for model evaluation.
hotel_model_dummy = pd.get_dummies(hotel_model, drop_first = True)
X = hotel_model_dummy.drop(['is_canceled_Yes'], axis = 1)
y = hotel_model_dummy['is_canceled_Yes']
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size = 0.80, shuffle = True, random_state = 333)
print(f"Predictors: {X_train.shape[1]}")
print(f"Training dataset: {y_train.shape[0]} observations")
print(f"Testing dataset: {y_test.shape[0]} observations")
First, we fit the train dataset into RandomForestClassifier without tuning any parameter to know the starting performance of our model as a benchmark.
model_default = RandomForestClassifier(n_jobs = multiprocessing.cpu_count()-1, oob_score = True, random_state = 333, verbose = 1)
model_default.fit(X_train, y_train)
Then we evaluate the model_default by printing out classification metrics such as:
def clfReport(y_true, y_pred):
report = classification_report(y_true, y_pred, output_dict = True,
target_names = ["Not Canceled", "Canceled"])
return pd.DataFrame(report).transpose()
def clfReport_default(model, X, y):
clfreport = clfReport(y, model.predict(X))
print(f'Accuracy: {clfreport.loc["accuracy", "support"]}')
return clfreport.iloc[:2, :-1]
clfReport_default(model_default, X_test, y_test)
Let's focus on recall because in this case, we want to minimize False Negative, where a canceled booking is predicted as not canceled. Suppose the marketing team of this hotel only offers follow-up promotions to customers who are predicted as canceled. If the recall is too low, then the potential customer who will cancel their booking won't be contacted.
From the report above, we conclude that model_default bias towards "Not Canceled" (negative class) because the recall is much higher than "Canceled" (positive class) - a gap of 16.5%. The cause of this may be because the target variable proportion is not equally balanced, the original ratio is 63:37. In the next section, we'll perform sampling to overcome this problem.
We can do either upsampling or downsampling to balance the positive and negative class of is_canceled into a 50:50 ratio.
In this case, we prefer to do downsampling since we still have thousands of observations, which are sufficient for modeling. But we only do downsampling on the training dataset, whereas still retain the ratio of testing data.
def sampleData(data, target_col, method = "down"):
freq = data[target_col].value_counts()
class_majority, class_minority = freq.idxmax(), freq.idxmin()
data_majority, data_minority = data[data[target_col] == class_majority], data[data[target_col] == class_minority]
if method == "down":
data_hold = data_minority
data_sample = data_majority
replacement = False
elif method == "up":
data_hold = data_majority
data_sample = data_minority
replacement = True
data_sampled = resample(data_sample,
n_samples = data_hold.shape[0],
replace = replacement,
random_state = 333)
return pd.concat([data_sampled, data_hold])
hotel_model_train_down = sampleData(pd.concat([X_train, y_train], axis = 1), "is_canceled_Yes", "down")
X_train_down = hotel_model_train_down.drop(['is_canceled_Yes'], axis = 1)
y_train_down = hotel_model_train_down['is_canceled_Yes']
Here is the proportion of target variable before and after downsampling.
pd.concat([y_train.value_counts().rename("Before Sampling"),
y_train_down.value_counts().rename("After Sampling")],
axis = 1).set_index(pd.Series(["Not Canceled", "Canceled"]))
We fit the downsampled train dataset again into RandomForestClassifier and compare it with model_default.
model_default_down = RandomForestClassifier(n_jobs = multiprocessing.cpu_count()-1, oob_score = True, random_state = 333, verbose = 1)
model_default_down.fit(X_train_down, y_train_down)
clfReport_default(model_default_down, X_test, y_test)
Good, now the recall gap is not too huge, only 4%. Next, let's do train-test evaluation to check whether the model is a good fit or not.
def trainTestEval(model, pos_class, X_train, X_test, y_train, y_test):
report_train = clfReport(y_train, model.predict(X_train))
report_test = clfReport(y_test, model.predict(X_test))
df = pd.concat([report_train.loc[pos_class, :].rename("Train"),
report_test.loc[pos_class, :].rename("Test")], axis = 1).transpose()
df.insert(loc = 0, column = "accuracy",
value = [report_train.loc["accuracy", "support"],
report_test.loc["accuracy", "support"]])
df["observation"] = np.array([X_train.shape[0], X_test.shape[0]])
df["pos_proportion"] = df["support"]/df["observation"]
return df.drop(columns = "support")
trainTestEval(model_default_down, "Canceled", X_train_down, X_test, y_train_down, y_test)
From the evaluation report above, we can conclude that model_default_down is overfitted the data, since the performance on the training dataset is nearly perfect (all metrics reach 99%) but not so good on the test dataset. To overcome this, we are going to use Random Search Cross-Validation which is one technique to do hyperparameter tuning.
This technique narrows down our search by evaluating a wide range of values for each parameter. Using the sklearn RandomizedSearchCV method, we define a grid of hyperparameter ranges random_grid and as the name suggests, it randomly samples from the grid, performing k-fold cross-validation with each combination of values.
random_grid = {
'n_estimators': [x for x in range(50, 1000, 50)],
'criterion': ['gini', 'entropy'],
'max_depth': [x for x in range(10, 100, 5)],
'min_samples_split': [x for x in range(10, 50, 5)],
'min_samples_leaf': [x for x in range(10, 50, 5)],
'max_features': ['sqrt', 'log2', None]
}
# number of possible combinations
mul = 1
for key in random_grid.keys():
mul *= len(random_grid[key])
mul
There are 131,328 possible combinations of the parameter from random_grid, we decided to build models by using only 500 combinations. Using 3-fold cross-validation means that each combination of parameters will be evaluated 3 times with a 67-33 train-test proportion. The scoring uses "recall" because in this case, we want to minimize False Negative, where a canceled booking is predicted as not canceled. Suppose the marketing team of this hotel only offers follow-up promotions to customers who are predicted as canceled. If the recall is too low, then the potential customer who will cancel their booking won't be contacted. Anyway, the Random Search process takes a while, around 5-6 hours to be specific, since we train them on CPU - maybe should consider using GPU.
model_base = RandomForestClassifier(oob_score = True,
warm_start = True,
n_jobs = multiprocessing.cpu_count()-1,
random_state = 333)
rf_random = RandomizedSearchCV(estimator = model_base,
param_distributions = random_grid,
scoring = 'recall',
n_iter = 500,
cv = 3,
verbose = 1,
n_jobs = multiprocessing.cpu_count()-1,
random_state = 333)
rf_random.fit(X_train_down, y_train_down)
We save the Random Search result to pickle.
def saveModel(model, pickle_name):
with open(pickle_name, 'wb') as f:
pickle.dump(model, f)
saveModel(rf_random, "rf_random_500.pkl")
Let's take a look of the result from best to worst in terms of Recall and print one combination of the parameter which yield the best recall.
pd.DataFrame(rf_random.cv_results_).sort_values('rank_test_score')
rf_random.best_params_
Next, we fit RandomForestClassifier using the best parameters.
model_tuned = RandomForestClassifier(**rf_random.best_params_, verbose = 1,
n_jobs = multiprocessing.cpu_count()-1, random_state = 333)
model_tuned.fit(X_train_down, y_train_down)
Just like we did on model_default, let's evaluate whether our model is still overfit or not:
model_tuned_metrics = trainTestEval(model_tuned, "Canceled", X_train_down, X_test, y_train_down, y_test)
model_tuned_metrics
It turns out that model_tuned is not too overfitted compare to model_Default. Considering the difference of recall between train and test is only 6%, before it was a 16% difference.
We can further tune model_tuned by changing the threshold when classifying the probabilities to a class. So far, the model use default threshold which is 0.5, means that if a booking is predicted to have more than 50% chance of cancellation, then it is classified as Canceled, otherwise Not Canceled. We create tuningThresholdPlot function to iteratively move the threshold from 0 to 1 and plot the classification metrics.
def tuningThresholdPlot(model, X, y):
tuning_list = []
y_pred_prob = model.predict_proba(X)[:,1]
for threshold in np.linspace(0, 1, 101)[:-1]:
y_pred_class = y_pred_prob > threshold
report = clfReport(y, y_pred_class)
tuning_res = {"threshold" : threshold,
"accuracy": report.loc["accuracy"][-1]}
tuning_res = {**tuning_res, **report.loc["Canceled"].to_dict()} # append dicts
tuning_res.pop("support", None)
tuning_list.append(tuning_res)
tuning_df = pd.DataFrame(tuning_list)
# plotting
ax = tuning_df.plot(x = "threshold", color = "rgyb", lw = 3, figsize = (10, 6))
# vertical line for center
diff = abs(tuning_df["recall"] - tuning_df["precision"])
thresh_center = tuning_df[diff == min(diff)]["threshold"].values[0]
ax.axvline(x = thresh_center, ls = '--', color = "k")
ax.text(x = thresh_center + 0.01, y = 1,
s = f"CENTER", fontsize = 12, color = "k")
ax.set_xticks(list(ax.get_xticks()[1:-1]) + [thresh_center])
ax.legend(loc = "upper center", bbox_to_anchor = (0.5, -0.15), shadow = True, ncol = 4)
ax.set_ylabel("Metric")
plt.title("TUNING THRESHOLD", size = 15, fontweight = "bold")
plt.show()
return tuning_df
tuning_thresh = tuningThresholdPlot(model_tuned, X_test, y_test)
The vertical line labeled "CENTER" is when precision equal to recall. As mentioned earlier, we do care about the recall, but on second priority, we have to take into account about the precision too so that our model can correctly classify bookings that are not canceled but predicted as canceled (False Positives). We want to minimize this case in order to reduce the risk of overbooking. In conclusion, we will move the threshold to be less than 0.53 (CENTER) to achieve higher recall, but still maintaining a good precision. Let's compare the center to the default threshold, which is 0.5.
DEFAULT_THRESHOLD = 0.5
CENTER_THRESHOLD = 0.53
compare_thresh = tuning_thresh[tuning_thresh["threshold"].isin([DEFAULT_THRESHOLD, CENTER_THRESHOLD])]
compare_thresh
Let's calculate gain/loss of the metrics if we use CENTER_THRESHOLD instead of DEFAULT_THRESHOLD
compare_thresh.iloc[1,:] - compare_thresh.iloc[0,:]
In conclusion, we prefer to use DEFAULT_THRESHOLD instead of CENTER_THRESHOLD in order to maintain recall value.
It is a probability curve which plots True Positive Rate (Recall) against False Positive Rate. The area under the ROC curve, called as Area Under Curve (AUC), measures the degree of classification separability. Higher the AUC, better the model is at distinguishing between canceled and not canceled bookings.
def rocauc(y_pred_list, y_test):
y_pred_list = [(np.zeros(len(y_test)), "BASELINE")] + y_pred_list
for y_pred, label in y_pred_list:
fpr, tpr, _ = roc_curve(y_test, y_pred)
auc = roc_auc_score(y_test, y_pred)
plt.plot(fpr, tpr, label = f"{label}\n(AUC: {auc:.3f})",
linestyle = '--' if label == "BASELINE" else '-')
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate (Recall)")
plt.title("RECEIVER OPERATING CHARACTERISTIC (ROC) CURVE", fontweight = "bold")
plt.legend(bbox_to_anchor=(1, 1))
plt.show()
rocauc([(model_default.predict(X_test), "DEFAULT"),
(model_tuned.predict(X_test), "THRESHOLD = 0.5"),
(model_tuned.predict_proba(X_test)[:,1] > CENTER_THRESHOLD, f"THRESHOLD = {CENTER_THRESHOLD}")], y_test)
The AUC of using DEFAULT_THRESHOLD or CENTER_THRESHOLD is the same, indicating the same measure of separability while we retain the recall performance if we use DEFAULT_THRESHOLD.
This refers to a technique that assign a score to features based on how useful they are at predicting a target variable. In this case, we want to know which feature is the most important in predicting hotel booking cancellation using sklearn built-in attributes feature_importances_.
def featureImportancesPlot(features, importances, n_features = 20):
feature_imp_df = pd.DataFrame({'Feature': features,
'Importance': importances}).set_index('Feature')\
.sort_values('Importance').tail(n_features)
ax = feature_imp_df.plot(kind = 'barh', legend = False,
title = f"Top {n_features} Feature Importances")
ax.set_xlabel("Relative Importance")
plt.show()
return feature_imp_df.sort_values('Importance', ascending = False)
feature_imp = featureImportancesPlot(X.columns, model_tuned.feature_importances_)
The top three features based on their importance are:
deposit_type_Non Refund: the customer made a deposit to guarantee the booking in the value of the total stay costlead_time: number of days that elapsed between the entering date of the booking into the PMS and the arrival dateadr: average daily rateWe successfully visualized the hotel booking data using plotly to gain insights about the data and modeling using Random Forest to predict the cancellation of the booking based on various predictors. Before modeling, we inspect the PPS score to do feature selection manually. Next, we tune the base model using RandomizedSearchCV to prevent overfitting, followed by threshold tuning checking. In the end, we achieve a quite good metrics:
There are several things need to be improved:
ccp_alpha to cut the tree with cost complexity pruning.cuML packages (but for now not available on Windows OS).